Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies

نویسندگان

  • John G. McMahon
  • Francis Jack Smith
چکیده

An automatic word-classification system has been designed that uses word unigram and bigram frequency statistics to implement a binary top-down form of word clustering and employs an average class mutual information metric. Words are represented as structural tags--n-bit numbers the most significant bit-patterns of which incorporate class information. The classification system has revealed some of the lexical structure of English, as well as some phonemic and semantic structure. The system has been compared---directly and indirectly--with other recent word-classification systems. We see our classification as a means towards the end of constructing multilevel class-based interpolated language models. We have built some of these models and carried out experiments that show a 7% drop in test set perplexity compared to a standard interpolated trigram language model.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated c...

متن کامل

Ensemble methods for offline handwritten text line recognition

This thesis investigates ensemble methods for offline recognition of English handwritten text lines. Multiple recognisers are automatically generated from a single base recognition system. Combining the output of these multiple recognisers provides the final ensemble result. The underlying recognisers are based on hidden Markov models. One model is built for each character. Based on the lexicon...

متن کامل

Automatic Reordering Rule Generation and Application of Reordering Rules in Stochastic Reordering Model for English-Myanmar Machine Translation

Reordering is one of the most challenging and important problems in Statistical Machine Translation. Without reordering capabilities, sentences can be translated correctly only in case when both languages implied in translation have a similar word order. When translating is between language pairs with high disparity in word order, word reordering is extremely desirable for translation accuracy ...

متن کامل

Extraction of Hierarchies Based on Inclusion of Co-occurring Words with Frequency Information

In this paper, we propose a method of automatically extracting word hierarchies based on the inclusion relations of word appearance patterns in corpora. We applied the complementary similarity measure (CSM) to determine a hierarchical structure of word meanings. The CSM is a similarity measure developed for recognizing degraded machine-printed text. There are CSMs for both binary and gray-scale...

متن کامل

Unsupervised Query Segmentation Using Monolingual Word Alignment Method

In this paper, we propose a novel unsupervised approach to query segmentation using the word alignment model which is usually adopted in statistical machine translation system. Query segmentation is to obtain complete phrases or concepts in a query by segmenting a sequence of query terms, which is an important query processing procedure for improving information retrieval performance in search ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Linguistics

دوره 22  شماره 

صفحات  -

تاریخ انتشار 1996